Methods to determine tumor gene copy number by analysis of cell-free dna

ABSTRACT

Methods are provided herein to improve automatic detection of copy number variation in nucleic acid samples. These methods provide improved approaches for determining baseline copy number of genetic loci within a sample, reduce variation due to features of genetic loci, sample preparation, and probe exhaustion.

CROSS-REFERENCE

This application claims priority to U.S. Provisional Application No.62/269,051, filed Dec. 17, 2015, which is hereby incorporated byreference in its entirety.

BACKGROUND

Cancer is caused by the accumulation of mutations within an individual'snormal cells, at least some of which result in improperly regulated celldivision. Such mutations commonly include copy number variations, inwhich the number of copies of a gene within a tumor genome increases ordecreases relative to the subject's noncancerous cells.

Detecting and characterizing copy number variation in tumor cells isused to monitor tumor progression, predict patient outcome, and refinetreatment choices. Conventional methods, however, are performed oncellular samples that are often obtained by painful and time-intensivebiopsies. Such biopsies also can often only examine a fraction of thetumor cells within a subject, and thus are not always representative ofthe population of tumor cells. There is a need for simpler, more rapidtests for copy number variation in tumors that do not require cellularbiopsies, fluorescent in situ hybridization (FISH), comparative genomehybridization arrays, or quantitative fluorescent polymerase chainreaction (PCR) assays.

A particular challenge in determining copy number variation usingsequencing data is that genetic loci will exhibit variance in theirdepth of coverage for reasons unrelated to true copy number. Forexample, amplification efficiency, PCR efficiency, and guanine-cytosinecontent can cause differing depths of coverage even for individualgenetic loci present in the sample at the same copy number. Improvedmethods of removing bias due to such effects are needed to improve copynumber detection.

SUMMARY

There exists a considerable need for improved methods to detect copynumber variation in tumor cells from samples derived from cell-freebodily fluids. The present invention addresses this need and providesadditional advantages. In one aspect, the present disclosure provides amethod comprising: (a) obtaining sequencing reads of deoxyribonucleicacid (DNA) molecules of a cell-free bodily fluid sample of a subject;(b) generating from the sequence reads a first data set comprising foreach genetic locus in a plurality of genetic loci a quantitative measurerelated to sequencing read coverage (“read coverage”); (c) correctingthe first data set by performing saturation equilibrium correction andprobe efficiency correction; (d) determining a baseline read coveragefor the first data set, wherein the baseline read coverage relates tosaturation equilibrium and probe efficiency; and (e) determining a copynumber state for each genetic locus in the plurality of genetic locirelative to the baseline read coverage. In some embodiments, the firstdata set comprises, for each genetic locus in a plurality of geneticloci, a quantitative measure related to (i) guanine-cytosine content(“GC content”) of the genetic locus. In some embodiments, the methodcomprises, prior to (c), removing from the first data set genetic locithat are high-variance genetic loci, wherein removing comprises: (i)fitting a model relating the quantitative measures related toguanine-cytosine content and the quantitative measures of sequencingread coverage of the genetic loci; and (ii) removing from the geneticloci at least 10% of the genetic loci, wherein the removing the geneticloci comprises removing genetic loci that most differ from the model,thereby providing the first data set of baselining genetic loci. In someembodiments, the method comprises removing at least 45% of the geneticloci.

In some embodiments, performing saturation equilibrium correctioncomprises transforming the first data set of baselining data geneticloci into a saturation corrected data set by: (i) determining for eachgenetic locus from the first data set of baselining genetic loci aquantitative measure related to the probability that a strand of DNAmolecule from the sample derived from the genetic locus is representedwithin the sequencing reads; (ii) determining a first transformation forthe read coverage by relating the read coverage of the first data set ofbaselining genetic loci to both the GC content of the first data set ofbaselining genetic loci and the quantitative measure related to theprobability that a strand of DNA derived from each locus in the firstdata set of baselining genetic loci is represented within the sequencingreads; and (iii) applying the first transformation to the read coverageof each genetic locus from the first data set of baselining genetic locito provide the saturation corrected data set, wherein the saturationcorrected data set comprises a first set of transformed read coveragesof the first data set of baselining genetic loci.

In some embodiments, determining the first transformation comprises (i)determining a measure related to central tendency of the read coverageof the first data set of baselining genetic loci; (ii) determining afunction that fits the measure related to central tendency of the readcoverage of the first data set of baselining genetic loci based on theGC content of the genetic locus and the quantitative measure related tothe probability that a strand of DNA derived from the genetic locus isrepresented within the sequencing reads; and (iii) for each geneticlocus of the first data set of baselining genetic loci, determining adifference between the read coverage predicted by the function and theread coverage, wherein the difference is the transformed read coverage.In some embodiments, the function is a surface approximation. In someembodiments provided herein, the surface approximation is atwo-dimensional second degree polynomial.

In some embodiments, performing probe efficiency correction comprisestransforming the saturation corrected data set into a probe efficiencycorrected data set by: (i) removing from the saturation corrected dataset genetic loci that are high-variance genetic loci with respect to thefirst set of transformed read coverages, thereby providing a second dataset of baselining genetic loci; (ii) determining a second transformationfor the first set of transformed read coverages related to the probeefficiency of the second data set of baselining genetic loci; and (iii)transforming the first set of transformed read coverages of the seconddata set of baselining genetic loci with the second transformation,thereby providing the probe efficiency corrected data set, wherein theprobe efficiency corrected data set comprises a second set oftransformed read coverages of the second data set of baselining geneticloci. In some embodiments, removing from the first data set genetic locithat are high-variance genetic loci comprises: (i) fitting a modelrelating the GC content and the first set of transformed read coveragesof the saturation corrected data set; and (ii) removing from saturationcorrected data set at least 10% of the genetic loci, wherein theremoving the genetic loci comprises removing genetic loci that mostdiffer from the model, thereby providing the second data set ofbaselining genetic loci. In some embodiments provided herein, theremoving is at least 45% of the genetic loci.

In some embodiments, probe efficiency is determined by performing thesaturation equilibrium correction on one or more reference samples,wherein the probe efficiency is the transformed read coverage obtainedby performing the saturation equilibrium correction. In someembodiments, one or more reference samples are cell-free bodily fluidsamples from a subject without cancer. In some embodiments providedherein the one or more reference samples are cell-free bodily fluidsamples from a subject with cancer, wherein the corresponding geneticlocus has not undergone copy number alteration.

In some embodiments, determining the second transformation comprises (i)fitting the probe efficiency determined for the genetic loci from theone or more reference samples to the first set of read coverages fromthe second data set of baselining genetic loci; (ii) dividing thetransformed read coverages of each genetic locus of the second data setof baselining genetic loci by a predicted probe efficiency based on thefitting of (i). In some embodiments, the method further comprises: (f)determining a third transformation for the second set of transformedread coverages by relating the transformed read coverages of the seconddata set of baselining genetic loci to both the GC content of the seconddata set of baselining genetic loci and the quantitative measure relatedto the probability that a strand of DNA derived from the each locus inthe second data set of baselining genetic loci is represented within thesequencing reads; and (g) applying the third transformation to thesecond set of transformed read coverages to provide a fourth data set,wherein the fourth data set comprises a third set of transformedquantitative read coverages.

In some embodiments, the DNA of the cell-free bodily fluid sample isenriched for the set of genetic loci using one or more oligonucleotideprobes that are complementary to at least a portion of the genetic locusfrom the set of genetic loci. In some embodiments, the GC content ofeach genetic locus from the set of genetic loci is a measure related tocentral tendency of guanine-cytosine content of the one or moreoligonucleotide probes that are complementary to at least a portion ofthe genetic locus from the set of genetic loci. In some embodiments, theread coverage of the genetic locus is a measure related to centraltendency of the read coverage of regions of the genetic locuscorresponding to the one or more oligonucleotide probes. In someembodiments, the performing saturation equilibrium correction and theperforming probe efficiency correction comprises fitting a Langmuirmodel, wherein the Langmuir model comprises probe efficiency (K) andsaturation equilibrium constant (Isat). In some embodiments, K and Isatare determined empirically for each oligonucleotide probe in the one ormore oligonucleotide probes. In some embodiments, the performingsaturation equilibrium correction and performing probe correctioncomprises fitting the read coverages of the genetic loci to the Langmuirmodel assuming that the genetic loci are present in identical copynumber states, thereby providing a baseline read coverage. In someembodiments, the identical copy number states are diploid. In someembodiments the baseline rad coverage is a function dependent on theprobe efficiency and the saturation equilibrium.

In some embodiments, determining a copy number state comprises comparingthe read coverage of the genetic loci to the baseline read coverage. Insome embodiments, the cell-free bodily fluid is selected from the groupconsisting of serum, plasma, urine, and cerebrospinal fluid. In someembodiments, the read coverage is determined by mapping the sequencingreads to a reference genome. In some embodiments, obtaining thesequencing reads comprises ligating adaptors to the DNA molecules fromthe cell-free bodily fluid from the subject. In some embodiments, theDNA molecules are duplex DNA molecules and the adaptors are ligated tothe duplex DNA molecules such that each adaptor differently tagscomplementary strands of the DNA molecule to provide tagged strands. Insome embodiments, determining the quantitative measure related to theprobability that a strand of DNA derived from the genetic locus isrepresented within the sequencing reads comprises sorting sequencingreads into paired reads and unpaired reads, wherein (i) each paired readcorresponds to sequence reads generated from a first tagged strand and asecond differently tagged complementary strand derived from adouble-stranded polynucleotide molecule in said set, and (ii) eachunpaired read represents a first tagged strand having no seconddifferently tagged complementary strand derived from a double-strandedpolynucleotide molecule represented among said sequence reads in saidset of sequence reads. In some embodiments, the method further comprisesdetermining quantitative measures of (i) said paired reads and (ii) saidunpaired reads that map to each of one or more genetic loci to determinea quantitative measure related to total double-stranded DNA molecules insaid sample that map to each of said one or more genetic loci based onsaid quantitative measure related to paired reads and unpaired readsmapping to each locus. In some embodiments, the adaptors comprisebarcode sequences.

In some embodiments, determining the read coverage comprises collapsingthe sequencing reads based on position of the mapping of the sequencingreads to the reference genome and the barcode sequences. In someembodiments, the genetic loci comprise one or more oncogenes. In someembodiments, a method comprises determining that at least a subset ofthe baselining genetic loci has undergone copy number alteration in thetumor cells of the subject by determining relative quantities ofvariants within the baselining genetic loci for which the germlinegenome of the subject is heterozygous. In some embodiments, the relativequantities of the variants are not approximately equal. In someembodiments, baselining genetic loci for which the relative quantitiesof the variants are not approximately equal are removed from thebaselining genetic loci, thereby providing allelic-frequency correctedbaselining genetic loci. In some embodiments, the allelic-frequencycorrected baselining genetic loci are used as the baselining loci in themethods of any one of the preceding claims.

In another aspect, the present disclosure provides a method comprising:receiving into memory sequencing reads of deoxyribonucleic acid (DNA)molecules of a cell-free bodily fluid sample of a subject; executingcode with a computer processor to perform the following steps:generating from the sequence reads a first data set comprising for eachgenetic locus in a plurality of genetic loci a quantitative measurerelated to sequencing read coverage (“read coverage”); correcting thefirst data set by performing saturation equilibrium correction and probeefficiency correction; determining a baseline read coverage for thefirst data set, wherein the baseline read coverage relates to saturationequilibrium and probe efficiency; and determining a copy number statefor each genetic locus in the plurality of genetic loci relative to thebaseline read coverage.

In another aspect, the present disclosure provides a system comprising:a network; a database comprising computer memory configured to storenucleic acid (e.g., DNA) sequence data which are connected to thenetwork; a bioinformatics computer comprising a computer memory and oneor more computer processors, which computer is connected to the network;wherein the computer further comprises machine-executable code which,when executed by the one or more computer processors, copies nucleicacid (e.g., DNA) sequence data stored on the database, writes the copieddata to memory in the bioinformatics computer and performs stepsincluding: generating from the nucleic acid (e.g., DNA) sequence data afirst data set comprising for each genetic locus in a plurality ofgenetic loci a quantitative measure related to sequencing read coverage(“read coverage”); correcting the first data set by performingsaturation equilibrium correction and probe efficiency correction;determining a baseline read coverage for the first data set, wherein thebaseline read coverage relates to saturation equilibrium and probeefficiency; and determining a copy number state for each genetic locusin the plurality of genetic loci relative to the baseline read coverage.In some embodiments, the database is connected to a DNA sequencer.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present invention will be obtained by reference to thefollowing detailed description that sets forth illustrative embodiments,in which the principles of the invention are utilized, and theaccompanying drawings of which:

FIG. 1 illustrates exemplary oncogenes and targets for sequence captureprobes.

FIG. 2 illustrates gene-level signal versus theoretical copy numberacross three spike-in and probe-level signal variation across spike-ingenes

FIG. 3 illustrates a bait optimization experiment relating bait amountwith unique molecular counts.

FIG. 4A and FIG. 4B illustrate the nonlinear effects of p (FIG. 4A) andGC content (FIG. 4B) on unique molecular counts.

FIG. 5 illustrates unique molecular counts per probe without saturationor probe-efficiency correction being performed.

FIG. 6 illustrates post-saturation correction unique molecular countsper probe.

FIG. 7 illustrates post-saturation and post-probe-efficiency correctedunique molecular counts per probe.

FIG. 8 illustrates a proposed Langmuir model of interactions betweentrue copy number and unique molecular counts related to probe saturationand probe efficiency.

FIG. 9 illustrates the probe signal-noise reduction for the baselininggenetic loci after saturation correct, probe efficiency correction, anda second round of probe efficiency correction in a typical clinicalsample.

FIG. 10A and FIG. 10B illustrate post-saturation corrected UMCs plottedagainst the probe efficiencies determined in the reference sample inorder to perform probe-efficiency correction. FIG. 10A is from a subjectwithout copy number alteration in tumor cells. FIG. 10B is from asubject with copy number alteration in tumor cells.

FIG. 11 illustrates a final report of saturation and probe efficiencycorrected copy number variation detection in a patient sample. Starsabove a sample indicate gene amplification detected based on thecorrected signal and minor-allele frequency corrected baselineoptimization.

FIG. 12 illustrates a computer system 1201 that is programmed orotherwise configured to implement methods of the present disclosure.

FIG. 13 illustrates observed copy number (CN) vs. theoretical CN for thegene ERBB2 as measured using a method of the present disclosure. Soliddots represent an observed copy number of ˜2 (a diploid sample), opendots represent detected amplification events and the thick horizontaldashed line marks the mean gene CN cutoff.

FIG. 14 illustrates observed copy number (CN) vs. theoretical CN for thegene ERBB2 as measured using a method of the present disclosure (dots)as compared to a control method (squares). Solid dots represent anobserved copy number of ˜2 (a diploid sample), open dots representdetected amplification events and the thick horizontal dashed line marksthe mean gene CN cut-off.

FIG. 15 illustrates probe copy number as plotted against probes used ina validation study for a method of the present disclosure (triangles)vs. a control method (X's).

DETAILED DESCRIPTION Definitions

The term “genetic variant,” as used herein, generally refers to analteration, variant or polymorphism in a nucleic acid sample or genomeof a subject. Such alteration, variant or polymorphism can be withrespect to a reference genome, which may be a reference genome of thesubject or other individual. Single nucleotide polymorphisms (SNPs) area form of polymorphisms. In some examples, one or more polymorphismscomprise one or more single nucleotide variations (SNVs), insertions,deletions, repeats, small insertions, small deletions, small repeats,structural variant junctions, variable length tandem repeats, and/orflanking sequences. Copy number variants (CNVs), transversions and otherrearrangements are also forms of genetic variation. A genomicalternation may be a base change, insertion, deletion, repeat, copynumber variation, or transversion.

The term “polynucleotide,” as used herein, generally refers to amolecule comprising one or more nucleic acid subunits. A polynucleotidecan include one or more subunits selected from adenosine (A), cytosine(C), guanine (G), thymine (T) and uracil (U), or variants thereof. Anucleotide can include A, C, G, T or U, or variants thereof. Anucleotide can include any subunit that can be incorporated into agrowing nucleic acid strand. Such subunit can be an A, C, G, T, or U, orany other subunit that is specific to one or more complementary A, C, G,T or U, or complementary to a purine (i.e., A or G, or variant thereof)or a pyrimidine (i.e., C, T or U, or variant thereof). A subunit canenable individual nucleic acid bases or groups of bases (e.g., AA, TA,AT, GC, CG, CT, TC, GT, TG, AC, CA, or uracil-counterparts thereof) tobe resolved. In some examples, a polynucleotide is deoxyribonucleic acid(DNA) or ribonucleic acid (RNA), or derivatives thereof. Apolynucleotide can be single-stranded or double stranded.

The term “subject,” as used herein, generally refers to an animal, suchas a mammalian species (e.g., human) or avian (e.g., bird) species, orother organism, such as a plant. More specifically, the subject can be avertebrate, a mammal, a mouse, a primate, a simian or a human. Animalsinclude, but are not limited to, farm animals, sport animals, and pets.A subject can be a healthy individual, an individual that has or issuspected of having a disease or a pre-disposition to the disease, or anindividual that is in need of therapy or suspected of needing therapy. Asubject can be a patient.

The term “genome” generally refers to an entirety of an organism'shereditary information. A genome can be encoded either in DNA or in RNA.A genome can comprise coding regions that code for proteins as well asnon-coding regions. A genome can include the sequence of all chromosomestogether in an organism. For example, the human genome has a total of 46chromosomes. The sequence of all of these together constitutes a humangenome.

The terms “adaptor(s)”, “adaptor(s)” and “tag(s)” are used synonymouslythroughout this specification. An adaptor or tag can be coupled to apolynucleotide sequence to be “tagged” by any approach includingligation, hybridization, or other approaches.

The term “library adaptor” or “library adaptor” as used herein,generally refers to a molecule (e.g., a polynucleotide) whose identity(e.g., sequence) can be used to differentiate polynucleotides in abiological sample (also “sample” herein).

The term “sequencing adaptor,” as used herein, generally refers to amolecule (e.g., a polynucleotide) that is adapted to permit a sequencinginstrument to sequence a target polynucleotide, such as by interactingwith the target polynucleotide to enable sequencing. The sequencingadaptor permits the target polynucleotide to be sequenced by thesequencing instrument. In an example, the sequencing adaptor comprises anucleotide sequence that hybridizes or binds to a capture polynucleotideattached to a solid support of a sequencing system, such as a flow cell.In another example, the sequencing adaptor comprises a nucleotidesequence that hybridizes or binds to a polynucleotide to generate ahairpin loop, which permits the target polynucleotide to be sequenced bya sequencing system. The sequencing adaptor can include a sequencermotif, which can be a nucleotide sequence that is complementary to aflow cell sequence of other molecule (e.g., polynucleotide) and usableby the sequencing system to sequence the target polynucleotide. Thesequencer motif can also include a primer sequence for use insequencing, such as sequencing by synthesis. The sequencer motif caninclude the sequence(s) needed to couple a library adaptor to asequencing system and sequence the target polynucleotide.

As used herein the terms “at least”, “at most” or “about”, whenpreceding a series, refers to each member of the series, unlessotherwise identified.

The term “about” and its grammatical equivalents in relation to areference numerical value can include a range of values up to plus orminus 10% from that value. For example, the amount “about 10” caninclude amounts from 9 to 11. In other embodiments, the term “about” inrelation to a reference numerical value can include a range of valuesplus or minus 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, or 1% from thatvalue.

The term “at least” and its grammatical equivalents in relation to areference numerical value can include the reference numerical value andgreater than that value. For example, the amount “at least 10” caninclude the value 10 and any numerical value above 10, such as 11, 100,and 1,000.

The term “at most” and its grammatical equivalents in relation to areference numerical value can include the reference numerical value andless than that value. For example, the amount “at most 10” can includethe value 10 and any numerical value under 10, such as 9, 8, 5, 1, 0.5,and 0.1.

The term “quantitative measure” refers to any measure of quantityincluding absolute and relative measures. A quantitative measure can be,for example, a number (e.g., a count), a percentage, a degree or athreshold.

The term “read coverage” refers to coverage by raw sequence reads or byprocessed sequence reads, such as unique molecular counts inferred fromraw sequence reads.

The term “baseline read coverage” refers to expected read coverage of aprobe in a sample comprising a diploid genome environment based on givenprobe parameters, such as GC content, probe efficiency, ligationefficiency, or pull down efficiency.

“Probe”, as used herein, refers to a polynucleotide comprising afunctionality. The functionality can be a detectable label(fluorescent), a binding moiety (biotin), or a solid support (amagnetically attractable particle or a chip).

“Complementarity” refers to the ability of a nucleic acid to formhydrogen bond(s) with another nucleic acid sequence by eithertraditional Watson-Crick or other non-traditional types. A percentcomplementarity indicates the percentage of residues in a nucleic acidmolecule which can form hydrogen bonds (Watson-Crick base pairing) witha second nucleic acid sequence (5, 6, 7, 8, 9, 10 out of 10 being 50%,60%, 70%, 80%, 90%, and 100% complementary, respectively). “Perfectlycomplementary” means that all the contiguous residues of a nucleic acidsequence will hydrogen bond with the same number of contiguous residuesin a second nucleic acid sequence.

“Substantially complementary” as used herein refers to a degree ofcomplementarity that is at least 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%,97%, 98%, 99%, or 100% over a region of 8, 9, 10, 11, 12, 13, 14, 15,16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, or morenucleotides, or refers to two nucleic acids that hybridize understringent conditions. Sequence identity, such as for the purpose ofassessing percent complementarity, may be measured by any suitablealignment algorithm, including but not limited to the Needleman-Wunschalgorithm (see e.g. the EMBOSS Needle aligner available at the worldwide web site: ebi.ac.uk/Tools/psa/emboss_needle/nucleotide.html,optionally with default settings), the BLAST algorithm (see e.g. theBLAST alignment tool available at blast.ncbi.nlm.nih.gov/Blast.cgi,optionally with default settings), or the Smith-Waterman algorithm (seee.g. the EMBOSS Water aligner available at the world wide web site:ebi.ac.uk/Tools/psa/emboss_water/nucleotide.html, optionally withdefault settings). Optimal alignment may be assessed using any suitableparameters of a chosen algorithm, including default parameters.

“Hybridization” refers to a reaction in which one or morepolynucleotides react to form a complex that is stabilized via hydrogenbonding between the bases of the nucleotide residues. The hydrogenbonding may occur by Watson Crick base pairing, Hoogstein binding, or inany other sequence specific manner according to base complementarity.The complex may comprise two strands forming a duplex structure, threeor more strands forming a multi stranded complex, a singleself-hybridizing strand, or any combination of these. A hybridizationreaction may constitute a step in a more extensive process, such as theinitiation of PCR, or the enzymatic cleavage of a polynucleotide by anendonuclease. A second sequence that is complementary to a firstsequence is referred to as the “complement” of the first sequence. Theterm “hybridizable” as applied to a polynucleotide refers to the abilityof the polynucleotide to form a complex that is stabilized via hydrogenbonding between the bases of the nucleotide residues in a hybridizationreaction.

The term “stringent hybridization conditions” refers to conditions underwhich a polynucleotide will hybridize preferentially to its targetsubsequence, and to a lesser extent to, or not at all to, othersequences. “Stringent hybridization” in the context of nucleic acidhybridization experiments are sequence dependent, and are differentunder different environmental parameters. An extensive guide to thehybridization of nucleic acids is found in Tijssen (1993) LaboratoryTechniques in Biochemistry and Molecular Biology—Hybridization withNucleic Acid Probes part I chapter 2 “Overview of principles ofhybridization and the strategy of nucleic acid probe assays”, Elsevier,New York.

Generally, highly stringent hybridization and wash conditions areselected to be about 5° C. lower than the thermal melting point (Tm) forthe specific sequence at a defined ionic strength and pH. The Tm is thetemperature (under defined ionic strength and pH) at which 50% of thetarget sequence hybridizes to a perfectly matched probe. Very stringentconditions are selected to be equal to the Tm for a particular probe.

Stringent hybridization conditions include a buffer comprising water, abuffer (a phosphate, tris, SSPE or SSC buffer at pH 6-9 or pH 7-8), asalt (sodium or potassium), and a denaturant (SDS, formamide or tween)and a temperature of 37° C.-70° C., 60° C.-65° C.

An example of stringent hybridization conditions for hybridization ofcomplementary nucleic acids which have more than 100 complementaryresidues on a filter in a Southern or northern blot is 50% formalin with1 mg of heparin at 42° C., with the hybridization being carried outovernight. An example of highly stringent wash conditions is 0.15 M NaClat 72° C. for about 15 minutes. An example of stringent wash conditionsis a 0.2×SSC wash at 65° C. for 15 minutes (see, Sambrook et al. for adescription of SSC buffer). Often, a high stringency wash is preceded bya low stringency wash to remove background probe signal. An examplemedium stringency wash for a duplex of, more than 100 nucleotides, is1×SSC at 45° C. for 15 minutes. An example low stringency wash for aduplex of, e. g., more than 100 nucleotides, is 4-6×SSC at 40° C. for 15minutes. In general, a signal to noise ratio of 2× (or higher) than thatobserved for an unrelated probe in the particular hybridization assayindicates detection of a specific hybridization.

In one aspect, the present disclosure provides a method comprising: (a)obtaining sequencing reads derived from deoxyribonucleic acid (DNA)molecules of a cell-free bodily fluid sample of a subject; (b)generating a first data set, the first data set comprising, for eachgenetic locus in a plurality of genetic loci, a quantitative measurerelated to (i) guanine-cytosine content of the genetic locus and (ii) aquantitative measure related to sequencing read coverage of the geneticlocus from the sequencing reads; (c) transforming the first data setinto a second data set by: (i) removing from the first data set geneticloci that are high-variance genetic loci with respect to thequantitative measure related to sequencing read coverage, therebyproviding a first set of remaining genetic loci; (ii) determining foreach genetic locus from the first set of remaining genetic loci aquantitative measure related to the probability that a strand of DNAfrom the sample derived from the genetic locus is represented within thesequencing reads; (iii) determining a first transformation for thequantitative measure related to sequencing read coverage by relating thequantitative measure related to sequencing read coverage of the firstset of remaining genetic loci to both the quantitative measure relatedto the GC content of the first set of remaining genetic loci and thequantitative measure related to the probability that a strand of DNAderived from the each locus in the first set of remaining genetic lociis represented within the sequencing reads; and (iv) applying the firsttransformation to the sequence read coverage of each genetic locus fromthe first set of remaining genetic loci to provide the second data set,wherein the second data set comprises a first set of transformedquantitative measures of sequencing read coverage of the first set ofremaining genetic loci.

In some embodiments, the method further comprises transforming thesecond data set into a third data set by: (d) removing from the seconddata set genetic loci that are high-variance genetic loci with respectto the first set of transformed quantitative measures of sequencing readcoverage, thereby providing a second set of remaining genetic loci; (e)determining a second transformation for the first set of transformedquantitative measures of sequencing read coverage related to theefficiency of the second set of remaining genetic loci; and (f)transforming the first set of transformed quantitative measures ofsequencing read coverage of the second set of remaining genetic lociwith the second transformation, thereby providing the third data set,wherein the third data set comprises a second set of transformedquantitative measures related to sequencing read coverage of the secondset of remaining genetic loci of (d, i);

Obtaining Sequencing Reads from DNA Molecules of a Cell-Free BodilyFluid from a Subject

Obtaining sequencing reads from DNA molecules of a cell-free bodilyfluid of a subject can comprise obtaining a cell-free bodily fluid.Exemplary cell-free bodily fluids are or can be derived from serum,plasma, blood, saliva, urine, synovial fluid, whole blood, lymphaticfluid, ascites fluid, interstitial or extracellular fluid, the fluid inspaces between cells, including gingival crevicular fluid, bone marrow,cerebrospinal fluid, saliva, mucous, sputum, semen, sweat, urine, or anyother bodily fluids. A cell-free bodily fluid can be selected from thegroup consisting of plasma, urine, or cerebrospinal fluid. A cell-freebodily fluid can be plasma. A cell-free bodily fluid can be urine. Acell-free bodily fluid can be cerebrospinal fluid.

Nucleic acid molecules, including DNA molecules, can be extracted fromcell-free bodily fluids. DNA molecules can be genomic DNA. DNA moleculescan be from cells of healthy tissue of the subject. DNA molecules can befrom noncancerous cells that have undergone somatic mutation. DNAmolecules can be from a fetus in a maternal sample. The skilled workerwill understand that, in embodiments wherein the DNA molecules are froma fetus in a maternal sample, a subject may refer to the fetus eventhough the sample is maternal. DNA molecules can be from precancerouscells of the subject. DNA molecules can be from cancerous cells of thesubject. DNA molecules can be from cells within primary tumors of thesubject. DNA molecules can be from secondary tumors of the subject. DNAmolecules can be circulating DNA. The circulating DNA can comprisecirculating tumor DNA (ctDNA). DNA molecules can be double-stranded orsingle-stranded. Alternatively, DNA molecule can comprise a combinationof a double-stranded portion and a single-stranded portion. DNAmolecules do not have to be cell-free. In some cases, the DNA moleculescan be isolated from a sample. For example, DNA molecules can becell-free DNA isolated from a bodily fluid, e.g., serum or plasma.

A sample can comprise various amounts of genome equivalents of nucleicacid molecules. For example, a sample of about 30 ng DNA can containabout 10,000 haploid human genome equivalents and, in the case of cfDNA,about 200 billion individual polynucleotide molecules. Similarly, asample of about 100 ng of DNA can contain about 30,000 haploid humangenome equivalents and, in the case of cfDNA, about 600 billionindividual molecules.

Cell-free DNA molecules may be isolated and extracted from bodily fluidsusing a variety of techniques known in the art. In some cases, cell-freenucleic acids may be isolated, extracted and prepared using commerciallyavailable kits such as the Qiagen Qiamp® Circulating Nucleic Acid Kitprotocol. In other examples, Qiagen Qubit™ dsDNA HS Assay kit protocol,Agilent™ DNA 1000 kit, or TruSeq™ Sequencing Library Preparation;Low-Throughput (LT) protocol may be used to quantify nucleic acids.Cell-free nucleic acids may be fetal in origin (via fluid taken from apregnant subject), or may be derived from tissue of the subject itself.Cell-free nucleic acids can be derived from a neoplasm (e.g. a tumor oran adenoma).

Generally, cell-free nucleic acids are extracted and isolated frombodily fluids through a partitioning step in which cell-free nucleicacids, as found in solution, are separated from cells and othernon-soluble components of the bodily fluid. Partitioning may include,but is not limited to, techniques such as centrifugation or filtration.In other cases, cells are not partitioned from cell-free nucleic acidsfirst, but rather lysed. In one example, the genomic DNA of intact cellsis partitioned through selective precipitation. Cell-free nucleic acids,including DNA, may remain soluble and may be separated from insolublegenomic DNA and extracted. Generally, after addition of buffers andother wash steps specific to different kits, nucleic acids may beprecipitated using isopropanol precipitation. Further clean up steps maybe used such as silica based columns to remove contaminants or salts.General steps may be optimized for specific applications. Non-specificbulk carrier nucleic acids, for example, may be added throughout thereaction to optimize certain aspects of the procedure such as yield.

Cell-free DNA molecules can be at most 500 nucleotides in length, atmost 400 nucleotides in length, at most 300 nucleotides in length, atmost 250 nucleotides in length, at most 225 nucleotides in length, atmost 200 nucleotides in length, at most 190 nucleotides in length, atmost 180 nucleotides in length, at most 170 nucleotides in length, atmost 160 nucleotides in length, at most 150 nucleotides in length, atmost 140 nucleotides in length, at most 130 nucleotides in length, atmost 120 nucleotides in length, at most 110 nucleotides in length, or atmost 100 nucleotides in length.

Cell-free DNA molecules can be at least 500 nucleotides in length, atleast 400 nucleotides in length, at least 300 nucleotides in length, atleast 250 nucleotides in length, at least 225 nucleotides in length, atleast 200 nucleotides in length, at least 190 nucleotides in length, atleast 180 nucleotides in length, at least 170 nucleotides in length, atleast 160 nucleotides in length, at least 150 nucleotides in length, atleast 140 nucleotides in length, at least 130 nucleotides in length, atleast 120 nucleotides in length, at least 110 nucleotides in length, orat least 100 nucleotides in length. In particular, cell-free nucleicacids can be between 140 and 180 nucleotides in length.

Cell-free DNA can comprise DNA molecules from healthy tissue and tumorsin various amounts. Tumor-derived cell-free DNA can be at least 0.1% ofthe total amount of cell-free DNA in the sample, at least 0.2% of thetotal amount of cell-free DNA in the sample, at least 0.5% of the totalamount of cell-free DNA in the sample, at least 0.7% of the total amountof cell-free DNA in the sample, at least 1% of the total amount ofcell-free DNA in the sample, at least 2% of the total amount ofcell-free DNA in the sample, at least 3% of the total amount ofcell-free DNA in the sample, at least 4% of the total amount ofcell-free DNA in the sample, at least 5% of the total amount ofcell-free DNA in the sample, at least 10% of the total amount ofcell-free DNA in the sample, at least 15% of the total amount ofcell-free DNA in the sample, at least 20% of the total amount ofcell-free DNA in the sample, at least 25% of the total amount ofcell-free DNA in the sample, or at least 30% of the total amount ofcell-free DNA in the sample, or more.

In some cases, DNA molecules can be sheared during the extractionprocess and comprise fragments between 100 and 400 nucleotides inlength. In some cases, nucleic acids can be sheared after extraction cancomprise nucleotides between 100 and 400 nucleotides in length. In somecases, DNA molecules are already between 100 and 400 nucleotides inlength and additional shearing is not purposefully implemented.

A subject can be an animal. A subject can be a mammal, such as a dog,horse, cat, mouse, rat, or human. A subject can be a human. A subjectcan be suspected of having cancer. A subject can have previouslyreceived a cancer diagnosis. The cancer status of a subject may beunknown. A subject can be male or female. A subject can be at least 20years old, at least 30 years old, at least 40 years old, at least 50years old, at least 60 years old, or at least 70 years old.

Sequencing may be by any method known in the art. For example,sequencing techniques include classic techniques (e.g., dideoxysequencing reactions (Sanger method) using labeled terminators orprimers and gel separation in slab or capillary) and next generationtechniques. Exemplary techniques include sequencing by synthesis usingreversibly terminated labeled nucleotides, pyrosequencing, 454sequencing, Illumina/Solexa sequencing, allele specific hybridization toa library of labeled oligonucleotide probes, sequencing by synthesisusing allele specific hybridization to a library of labeled clones thatis followed by ligation, real time monitoring of the incorporation oflabeled nucleotides during a polymerization step, polony sequencing,SOLiD sequencing targeted sequencing, single molecule real-timesequencing, exon sequencing, electron microscopy-based sequencing, panelsequencing, transistor-mediated sequencing, direct sequencing, randomshotgun sequencing, whole-genome sequencing, sequencing byhybridization, capillary electrophoresis, gel electrophoresis, duplexsequencing, cycle sequencing, single-base extension sequencing,solid-phase sequencing, high-throughput sequencing, massively parallelsignature sequencing, emulsion PCR, co-amplification at lowerdenaturation temperature-PCR (COLD-PCR), multiplex PCR, sequencing byreversible dye terminator, paired-end sequencing, near-term sequencing,exonuclease sequencing, sequencing by ligation, short-read sequencing,single-molecule sequencing, real-time sequencing, reverse-terminatorsequencing, nanopore sequencing, MS-PET sequencing, and a combinationthereof. In some embodiments, the sequencing method is massivelyparallel sequencing, that is, simultaneously (or in rapid succession)sequencing any of at least 100, 1000, 10,000, 100,000, 1 million, 10million, 100 million, or 1 billion polynucleotide molecules. In someembodiments, sequencing can be performed by a gene analyzer such as, forexample, gene analyzers commercially available from Illumina or AppliedBiosystems. Sequencing of separated molecules has more recently beendemonstrated by sequential or single extension reactions usingpolymerases or ligases as well as by single or sequential differentialhybridizations with libraries of probes. Sequencing may be performed bya DNA sequencer (e.g., a machine designed to perform sequencingreactions). In some embodiments, a DNA sequencer can comprise or beconnected to a database, for example, that contains DNA sequence data.

A sequencing technique that can be used includes, for example, use ofsequencing-by-synthesis systems. In the first step, DNA is sheared intofragments of approximately 300-800 base pairs, and the fragments areblunt ended. Oligonucleotide adaptors are then ligated to the ends ofthe fragments. The adaptors serve as primers for amplification andsequencing of the fragments. The fragments can be attached to DNAcapture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B,which contains 5′-biotin tag. The fragments attached to the beads arePCR amplified within droplets of an oil-water emulsion. The result ismultiple copies of clonally amplified DNA fragments on each bead. In thesecond step, the beads are captured in wells (pico-liter sized).Pyrosequencing is performed on each DNA fragment in parallel. Additionof one or more nucleotides generates a light signal that is recorded bya CCD camera in a sequencing instrument. The signal strength isproportional to the number of nucleotides incorporated. Pyrosequencingmakes use of pyrophosphate (PPi) which is released upon nucleotideaddition. PPi is converted to ATP by ATP sulfurylase in the presence ofadenosine 5′ phosphosulfate. Luciferase uses ATP to convert luciferin tooxyluciferin, and this reaction generates light that is detected andanalyzed.

Another example of a DNA sequencing technique that can be used is SOLiDtechnology by Applied Biosystems from Life Technologies Corporation(Carlsbad, Calif.). In SOLiD sequencing, genomic DNA is sheared intofragments, and adaptors are attached to the 5′ and 3′ ends of thefragments to generate a fragment library. Alternatively, internaladaptors can be introduced by ligating adaptors to the 5′ and 3′ ends ofthe fragments, circularizing the fragments, digesting the circularizedfragment to generate an internal adaptor, and attaching adaptors to the5′ and 3′ ends of the resulting fragments to generate a mate-pairedlibrary. Next, clonal bead populations are prepared in microreactorscontaining beads, primers, template, and PCR components. Following PCR,the templates are denatured and beads are enriched to separate the beadswith extended templates. Templates on the selected beads are subjectedto a 3′ modification that permits bonding to a glass slide. The sequencecan be determined by sequential hybridization and ligation of partiallyrandom oligonucleotides with a central determined base (or pair ofbases) that is identified by a specific fluorophore. After a color isrecorded, the ligated oligonucleotide is removed and the process is thenrepeated.

Another example of a DNA sequencing technique that can be used is ionsemiconductor sequencing using, for example, a system sold under thetrademark ION TORRENT by Ion Torrent by Life Technologies (South SanFrancisco, Calif.). Ion semiconductor sequencing is described, forexample, in Rothberg, et al., An integrated semiconductor deviceenabling non-optical genome sequencing, Nature 475:348-352 (2011); U.S.Pub. 2010/0304982; U.S. Pub. 2010/0301398; U.S. Pub. 2010/0300895; U.S.Pub. 2010/0300559; and U.S. Pub. 2009/0026082, the contents of each ofwhich are incorporated by reference in their entirety.

Another example of a sequencing technology that can be used is Illuminasequencing. Illumina sequencing is based on the amplification of DNA ona solid surface using fold-back PCR and anchored primers. Genomic DNA isfragmented, and adapters are added to the 5′ and 3′ ends of thefragments. DNA fragments that are attached to the surface of flow cellchannels are extended and bridge amplified. The fragments become doublestranded, and the double stranded molecules are denatured. Multiplecycles of the solid-phase amplification followed by denaturation cancreate several million clusters of approximately 1,000 copies ofsingle-stranded DNA molecules of the same template in each channel ofthe flow cell. Primers, DNA polymerase and four fluorophore-labeled,reversibly terminating nucleotides are used to perform sequentialsequencing. After nucleotide incorporation, a laser is used to excitethe fluorophores, and an image is captured and the identity of the firstbase is recorded. The 3′ terminators and fluorophores from eachincorporated base are removed and the incorporation, detection andidentification steps are repeated. Sequencing according to thistechnology is described in U.S. Pat. No. 7,960,120; U.S. Pat. No.7,835,871; U.S. Pat. No. 7,232,656; U.S. Pat. No. 7,598,035; U.S. Pat.No. 6,911,345; U.S. Pat. No. 6,833,246; U.S. Pat. No. 6,828,100; U.S.Pat. No. 6,306,597; U.S. Pat. No. 6,210,891; U.S. Pub. 2011/0009278;U.S. Pub. 2007/0114362; U.S. Pub. 2006/0292611; and U.S. Pub.2006/0024681, each of which are incorporated by reference in theirentirety.

Another example of a sequencing technology that can be used includes thesingle molecule, real-time (SMRT) technology of Pacific Biosciences(Menlo Park, Calif.). In SMRT, each of the four DNA bases is attached toone of four different fluorescent dyes. These dyes are phospholinked. Asingle DNA polymerase is immobilized with a single molecule of templatesingle stranded DNA at the bottom of a zero-mode waveguide (ZMW). Ittakes several milliseconds to incorporate a nucleotide into a growingstrand. During this time, the fluorescent label is excited and producesa fluorescent signal, and the fluorescent tag is cleaved off. Detectionof the corresponding fluorescence of the dye indicates which base wasincorporated. The process is repeated.

Another example of a sequencing technique that can be used is nanoporesequencing (Soni & Meller, 2007, Progress toward ultrafast DNA sequenceusing solid-state nanopores, Clin Chem 53(11):1996-2001). A nanopore isa small hole, of the order of 1 nanometer in diameter. Immersion of ananopore in a conducting fluid and application of a potential across itresults in a slight electrical current due to conduction of ions throughthe nanopore. The amount of current which flows is sensitive to the sizeof the nanopore. As a DNA molecule passes through a nanopore, eachnucleotide on the DNA molecule obstructs the nanopore to a differentdegree. Thus, the change in the current passing through the nanopore asthe DNA molecule passes through the nanopore represents a reading of theDNA sequence.

Another example of a sequencing technique that can be used involvesusing a chemical-sensitive field effect transistor (chemFET) array tosequence DNA (for example, as described in U.S. Pub. 2009/0026082). Inone example of the technique, DNA molecules can be placed into reactionchambers, and the template molecules can be hybridized to a sequencingprimer bound to a polymerase. Incorporation of one or more triphosphatesinto a new nucleic acid strand at the 3′ end of the sequencing primercan be detected by a change in current by a chemFET. An array can havemultiple chemFET sensors. In another example, single nucleic acids canbe attached to beads, and the nucleic acids can be amplified on thebead, and the individual beads can be transferred to individual reactionchambers on a chemFET array, with each chamber having a chemFET sensor,and the nucleic acids can be sequenced.

Another example of a sequencing technique that can be used involvesusing an electron microscope as described, for example, by Moudrianakis,E. N. and Beer M., in Base sequence determination in nucleic acids withthe electron microscope, III. Chemistry and microscopy ofguanine-labeled DNA, PNAS 53:564-71 (1965). In one example of thetechnique, individual DNA molecules are labeled using metallic labelsthat are distinguishable using an electron microscope. These moleculesare then stretched on a flat surface and imaged using an electronmicroscope to measure sequences.

Prior to sequencing, adaptor sequences can be attached to the nucleicacid molecules and the nucleic acids can be enriched for particularsequences of interest. Sequence enrichment can occur before or after theattachment of adaptor sequence.

The nucleic acid molecules or enriched nucleic acid molecules can beattached to any sequencing adaptor suitable for use on any sequencingplatform disclosed herein. For example, a sequence adaptor can comprisea flow cell sequence, a sample barcode, or both. In another example, asequence adaptor can be a hairpin shaped adaptor, a Y-shaped adaptor, aforked adaptor, and/or comprise a sample barcode. In some cases, theadaptor does not comprise a sequencing primer region. In some cases theadaptor-attached DNA molecules are amplified, and the amplificationproducts are enriched for specific sequences as described herein. Insome cases, the DNA molecules are enriched for specific sequences afterpreparing a sequencing library. Adaptors can comprise barcode sequence.The different barcode can be at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, or more (or anylength as described throughout) nucleic acid bases, e.g., 7 bases. Thebarcodes can be random sequences, degenerate sequences, semi-degeneratesequences, or defined sequences. In some cases, there is a sufficientdiversity of barcodes that substantively (e.g., at least 70%, at least80%, at least 90%, or at least 99% of) each nucleic acid molecule istagged with a different barcode sequence. In some cases, there is asufficient diversity of barcodes that substantively (e.g., at least 70%,at least 80%, at least 90%, or at least 99% of) each nucleic acidmolecule from a particular genetic locus is tagged with a differentbarcode sequence.

A sequencing adaptor can comprise a sequence capable of hybridizing toone or more sequencing primers. A sequencing adaptor can furthercomprise a sequence hybridizing to a solid support, e.g., a flow cellsequence. For example, a sequencing adaptor can be a flow cell adaptor.The sequencing adaptors can be attached to one or both ends of apolynucleotide fragment. In another example, a sequencing adaptor can behairpin shaped. For example, the hairpin shaped adaptor can comprise acomplementary double-stranded portion and a loop portion, where thedouble-stranded portion can be attached (e.g., ligated) to adouble-stranded polynucleotide. Hairpin shaped sequencing adaptors canbe attached to both ends of a polynucleotide fragment to generate acircular molecule, which can be sequenced multiple times.

In some cases, none of the library adaptors contains a sampleidentification motif (or sample molecular barcode). Such sampleidentification motif can be provided via sequencing adaptors. A sampleidentification motif can include a sequencer of at least 4, 5, 6, 7, 8,9, 10, 20, 30, or 40 nucleotide bases that permits the identification ofpolynucleotide molecules from a given sample from polynucleotidemolecules from other samples. For example, this can permitpolynucleotide molecules from two subjects to be sequenced in the samepool and sequence reads for the subjects subsequently identified.

A sequencer motif includes nucleotide sequence(s) needed to couple alibrary adaptor to a sequencing system and sequence a targetpolynucleotide coupled to the library adaptor. The sequencer motif caninclude a sequence that is complementary to a flow cell sequence and asequence (sequencing initiation sequence) that can be selectivelyhybridized to a primer (or priming sequence) for use in sequencing. Forexample, such sequencing initiation sequence can be complementary to aprimer that is employed for use in sequence by synthesis (e.g.,Illumina). Such primer can be included in a sequencing adaptor. Asequencing initiation sequence can be a primer hybridization site.

In some cases, none of the library adaptors contains a completesequencer motif. The library adaptors can contain partial or nosequencer motifs. In some cases, the library adaptors include asequencing initiation sequence. The library adaptors can include asequencing initiation sequence but no flow cell sequence. The sequenceinitiation sequence can be complementary to a primer for sequencing. Theprimer can be a sequence specific primer or a universal primer. Suchsequencing initiation sequences may be situated on single-strandedportions of the library adaptors. As an alternative, such sequencinginitiation sequences may be priming sites (e.g., kinks or nicks) topermit a polymerase to couple to the library adaptors during sequencing.

Adaptors can be attached to DNA molecules by ligation. In some cases,the adaptors are ligated to duplex DNA molecules such that each adaptordifferently tags complementary strands of the DNA molecule. In somecases, adaptor sequences can be attached by PCR, wherein a first portionof a single-stranded DNA is complementary to a target sequence and asecond portion comprises the adaptor sequence.

Enrichment for particular sequences of interest can be performed bysequence capture methods. Sequence capture can be performed usingimmobilized probes that hybridize to the targets of interest. Sequencecapture can be performed using probes attached to functional groups,e.g., biotin, that allow probes hybridized to specific sequences to beenriched for from a sample by pulldown. In some cases, prior tohybridization to functionalized probes, specific sequences such asadaptor sequences from library fragments can be masked by annealingcomplementary, non-functionalized polynucleotide sequences to thefragments in order to reduce non-specific or off-target binding.Sequence probes can target specific genes. Sequence capture probes cantarget specific genetic loci or genes. Such genes can be oncogenes.Exemplary genes targeted by capture probes include those shown inFIG. 1. Exemplary genes with point mutations (SNVs) include, but are notlimited to, AKT1, ATM, CCNE1, CTNNB1, FGFR1, GNAS, JAK3, MLH1, NPM1,PTPN11, RIT1, TERT, ALK, BRAF, CDH1, EGFR, FGFR2, HNF1A, KIT MPL, NRAS,RAF1, ROS1, TP53, APC, BRCA1, CDK4, ERBB2, FGFR3, HRAS, KRAS, MYC,NTRK1, RB1, SMAD4, TSC1, AR, BRCA2, CDK6, ESR1, GATA3, IDH2, MAP2K2,NFE2L2, PIK3CA, RHEB, SRC, ARID1A, CCND2, CDKN2B, FBXW7, GNAQ, JAK2,MET, NOTCH1, PTEN, RHOA, and STK11. Exemplary genes with copy numbervariations include, but are not limited to, AR, CCNE1, CDK6, ERBB2,FGFR2, KRAS, MYC, PIK3CA, BRAF, CDK4, EGFR, FGFR1, KIT, MET, PDGFRA, andRAF1. Exemplary genes with gene fusions include, but are not limited to:ALK, FGFR2, FGFR3, NTRK1, RET, and ROS1. Exemplary genes with indelsinclude, but are not limited to: EGFR (for example, at exons 19 and 20),ERBB2 (for example, at exons 19 and 20), and MET (for example, skippingexon 14). Exemplary targets can include CCND1 and CCND2. Sequencecapture probes can tile across a gene (e.g., probes can targetoverlapping regions). Sequence probes can target non-overlappingregions. Sequence probes can be optimized for length, meltingtemperature, and secondary structure.

Quantitative Measures of Guanine-Cytosine (GC) Content

Guanine-cytosine content is the percentage of nitrogenous bases of a DNAmolecule that are either guanine or cytosine. A quantitative measurerelated to GC content for a genetic locus can be the GC content of theentire genetic locus. A quantitative measure related to GC content for agenetic locus can be the GC content of the exonic regions of the gene. Aquantitative measure related to GC content for a genetic locus can bethe GC content of the regions covered by reads mapping to the geneticlocus. A quantitative measure related to GC content can be the GCcontent of the sequence capture probes corresponding to the geneticlocus. A quantitative measure related to GC content for a genetic locuscan be a measure related to central tendency of the GC content of thesequence capture probes corresponding to the genetic locus. The measurerelated to central tendency can be any measure of central tendency suchas mean, median, or mode. The measure related to central tendency can bethe median. GC content of a given region can be measured by dividing thenumber of guanosine and cytosine bases by the total number of bases overthat region.

Quantitative Measures of Sequencing Read Coverage

A quantitative measure related to sequencing read coverage is a measureindicative of the number of reads derived from a DNA moleculecorresponding to a genetic locus (e.g., a particular position, base,region, gene or chromosome from a reference genome). In order toassociate reads to a genetic locus, the reads can be mapped or alignedto the reference. Software to perform mapping or aligning (e.g., Bowtie,BWA, mrsFAST, BLAST, BLAT) can associate a sequencing read with agenetic locus. During the mapping process, particular parameters can beoptimized. Non-limiting examples of optimization of the mappingprocessing can include masking repetitive regions; employing mappingquality (e.g., MAPQ) score cut-offs; using different seed lengths togenerate alignments; and limiting the edit distance between positions ofthe genome.

Quantitative measures associated with sequencing read coverage caninclude counts of reads associated with a genetic locus. In some cases,the counts are transformed into new metrics to mitigate the effects ofdiffering sequencing depth, library complexity, or size of the geneticlocus. Exemplary metrics are Read Per Kilobase per Million (RPKM),Fragments Per Kilobase per Million (FPKM), Trimmed Mean of M values(TMM), variance stabilized raw counts, and log transformed raw counts.Other transformations are also known to those of skill in the art thatmay be used for particular applications.

Quantitative measures can be determined using collapsed reads, whereineach collapsed read corresponds to an initial template DNA molecule.Methods to collapse and quantify read families are found inPCT/US2013/058061 and PCT/US2014/000048, each of which is hereinincorporated by reference in its entirety. In particular, collapsingmethods can be employed that use barcodes and sequence information fromthe sequencing read to collapse reads into families, such that eachfamily shares barcode sequences and at least a portion of the sequencingread sequence. Each family is then, for the majority of the families,derived from a single initial template DNA molecule. Counts derived frommapping sequences from families can be referred to as “unique molecularcounts” (UMCs). In some cases, determining a quantitative measurerelated to sequencing read coverage comprises normalizing UMCs by ametric related to library size to provide normalized UMCs (“normalizedUMCs”). Exemplary methods are dividing the UMC of a genetic locus by thesum of all UMCs; dividing the UMC of a genetic locus by the sum of allautosomal UMCs. When comparing multiple sequencing read data sets, UMCscan, for example, be normalized by the median UMCs of the genetic lociof the two sequencing read data sets. In some cases, the quantitativemeasure related to sequencing read coverage can be normalized UMCs thatare further normalized as follows: (i) normalized UMCs are determinedfor corresponding genetic loci from sequencing reads derived fromtraining samples; (ii) for each genetic locus, normalized UMCs of thesample are normalized by the median of the normalized UMCs of thetraining samples at the corresponding loci, thereby providing RelativeAbundances (RAs) of genetic loci.

Consensus sequences can identified based on their sequences, for exampleby collapsing sequencing reads based on identical sequences within thefirst 5, 10, 15, 20, or 25 bases. In some cases, collapsing allows for 1difference, 2 differences, 3 differences, 4 differences, or 5differences in the reads that are otherwise identical. In some cases,collapsing uses the mapping position of the read, for example themapping position of the initial base of the sequencing read. In somecases, collapsing uses barcodes, and sequencing reads that share barcodesequences are collapsed into a consensus sequence. In some cases,collapsing uses both barcodes and the sequence of the initial templatemolecules. For example, all reads that share a barcode and map to thesame position in the reference genome can be collapsed. In anotherexample, all reads that share a barcode and a sequence of the initialtemplate molecule (or a percentage identity to a sequence of the initialtemplate molecule) can be collapsed.

In some cases, quantitative measures of sequencing read coverage aredetermined for specific sub-regions of a genome. Regions can be bins,genes of interest, exons, regions corresponding to sequence probes,regions corresponding to primer amplification products, or regionscorresponding to primer binding sites. In some cases, sub-regions of thegenome are regions corresponding to sequence capture probes. A read canmap to a region corresponding to the sequence capture probe if at leasta portion of the read maps at least a portion of the regioncorresponding to the sequence capture probe. A read can map to a regioncorresponding to the sequence capture probe if at least a portion of theread maps to the majority of the region corresponding to the sequencecapture probe. A read can map to a region corresponding to the sequencecapture probe if at least a portion of the read maps across the centerpoint of the region corresponding to the sequence capture probe. In somecases, a quantitative measure related to sequencing read coverage of agenetic locus is the median of the RAs of the probes corresponding togenomic locations within the genetic locus. For example, if KRAS iscovered by three probes, which have RAs of 2, 3, and 5, the RA of thegenetic locus would be 3.

“Saturation Equilibrium” Correction

In general, the methods described herein can be used to increase thespecificity and sensitivity of variant calling (e.g., detecting copynumber variants) in a nucleic acid sample. For example, the methods candecrease the amount of noise or distortion in a data sample, reducingthe number of false positive variants detected. As noise and/ordistortion decrease, specificity and sensitivity increase. Noise can bethought of as an unwanted random addition to a signal. Distortion can bethought of as an alteration in the amplitude of a signal or portion of asignal.

Noise can be introduced through errors in copying and/or reading apolynucleotide. For example, in a sequencing process, a singlepolynucleotide can first be subject to amplification. Amplification canintroduce errors, so that a subset of the amplified polynucleotides maycontain, at a particular locus, a base that is not the same as theoriginal base at that locus. Furthermore, in the reading process a baseat any particular locus may be read incorrectly. As a consequence, thecollection of sequence reads can include a certain percentage of basecalls at a locus that are not the same as the original base. In typicalsequencing technologies this error rate can be in the single digits,e.g., 2%-3%. When a collection of molecules that are all presumed tohave the same sequence are sequenced, this noise is sufficiently smallthat one can identify the original base with high reliability.

However, if a collection of parent polynucleotides includes a subset ofpolynucleotides having sequence variants at a particular locus, noisecan be a significant problem. This can be the case, for example, whencell free DNA includes not only germline DNA, but DNA from anothersource, such as fetal DNA or DNA from a cancer cell. In this case, ifthe frequency of molecules with sequence variants is in the same rangeas the frequency of errors introduced by the sequencing process, thentrue sequence variants may not be distinguishable from noise. This couldinterfere, for example, with detecting sequence variants in a sample.

Distortion can be manifested in the sequencing process as a differencein signal strength, e.g., total number of sequence reads, produced bymolecules in a parent population at the same frequency. Distortion canbe introduced, for example, through amplification bias, GC bias, orsequencing bias. This could interfere with detecting copy numbervariation in a sample. GC bias results in the uneven representation ofareas rich or poor in GC content in the sequence reading.

Methods disclosed herein comprise determining an initial set of geneticloci for use in determining a baseline by removing from a data set thosegenetic loci for which the quantitative measure related to sequencingread coverage or the transformed quantitative measure related tosequencing read coverage differs most from a predictive model (which canbe referred to herein as removing high-variance genetic loci), therebyproviding a first set of remaining genetic loci. In some instances,removing these genetic loci comprises fitting a model that relates thequantitative measures related to sequencing read coverage to thequantitative measures related to GC content of the genetic loci. Forexample, the predictive model can relate the RAs of the genetic loci tothe GC content of the loci. In some cases, the predictive model is aregression model, including non-parametric regression models such asLOESS and LOWESS regression models. In some cases, baselining isperformed by removing 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%,55%, 60%, 65%, or 70% of the genetic loci that deviate the most from thepredictive model. In some cases, baselining is performed by removing atleast 5%, at least 10%, at least 15%, at least 20%, at least 25%, atleast 30%, at least 35%, at least 40%, at least 45%, at least 50%, atleast 55%, at least 60%, at least 65%, or at least 70% of the geneticloci that deviate the most from the predictive model. In some cases,deviation is determined by measuring the residuals of the genetic locirelative to the model. The exact cut-off can be chosen to provideexclude a specific amount of variance from the remaining genetic loci.

Methods to determine a quantitative measure related to the probabilitythat a strand of DNA from the sample derived from the genetic locus isrepresented within the sequencing reads are disclosed inPCT/US2014/072383, which is hereby incorporated by reference in itsentirety. Determining the quantitative measure can comprise estimatingnumber of initial template DNA molecules derived from a locus that werepresent in the sample. The probability that a double strandpolynucleotide generates no sequence reads can be determined based onthe relative number of reads representing both strands of an initialtemplate DNA molecule and reads representing only a single strand of aninitial template DNA molecule.

The number of undetected initial template DNA molecules in a sample canbe estimated based on the relative number of reads representing bothstrands of an initial template DNA molecule and reads representing onlya single strand of an initial template DNA molecule. As an example,counts for a particular genetic locus, Locus A, are recorded, where 1000molecules are paired (e.g., both strands are detected) and 1000molecules are unpaired (e.g., only a single strand is detected). Itshould be noted that the terms “paired” and “unpaired” as used hereinare distinct from these terms as sometimes applied to sequencing readsto indicated whether both ends or a single-end of a molecule aresequenced. Assuming a uniform probability, p, for an individual Watsonor Crick strand to make it through the process subsequent to conversion,one can calculate the proportion of molecules that fail to make itthrough the process (Unseen) as follows: R, the ratio of paired tounpaired molecules=1000/1000=1, therefore R=1=p²/(2p(1−p)). This impliesthat p=⅔ and that the quantity of lost molecules is equal to (1−p)²=1/9. Thus in this example, approximately 11% of converted molecules arelost and never detected. In addition to using binomial distribution,other methods of estimating numbers of unseen molecules includeexponential, beta, gamma or empirical distributions based on theredundancy of sequence reads observed. In the latter case, thedistribution of read counts for paired and unpaired molecules can bederived from such redundancy to infer the underlying distribution oforiginal polynucleotide molecules at a particular locus. This can oftenlead to a better estimation of the number of unseen molecules. In somecases, p is the quantitative measure related to the probability that astrand of DNA from the sample derived from the genetic locus isrepresented in the sequencing reads. In some cases, p is similarlyderived, but a different model of read distribution is used (e.g.,binomial, poisson, beta, gamma, and negative binomial distribution).

A transformation for the quantitative measure related to sequencing readcoverage can be determined by relating the quantitative measure ortransformed sequencing read coverage from a set of genetic loci withhigh-variance genetic loci removed to the quantitative measure relatedto GC content and the quantitative measure related to the probabilitythat a strand of DNA derived from the genetic locus is representedwithin the sequencing reads. In some cases, the remaining genetic lociare assumed to be diploid and/or to be present at the same copy number.In some instances, a transformation is determined by fitting a measurerelated to central tendency of the quantitative measures related tosequencing read coverage of the remaining genetic loci by thequantitative measure related to GC content and the quantitative measurerelated to the probability that a strand of DNA derived from the geneticlocus is represented within the sequencing reads. A transformation can,for example, (i) fit the central tendency of the quantitative measuressequencing read coverage of the remaining genetic loci after removal ofhigh-variance genetic loci by both the quantitative measures related toGC content and the quantitative measures related to the probability thata strand of DNA derived from the genetic locus is represented within thesequencing reads. In some instances, the measure related to centraltendency of the quantitative measures of sequencing read coverage of theremaining loci is the central tendency of the UMCs of the remaininggenetic loci. In some instances, a surface approximation is used to fita surface of UMCs of the remaining genetic loci or the central tendencyof the UMCs of the remaining genetic loci by (i) the quantitativemeasures related to GC content and (ii) the quantitative measuresrelated to the probability that a strand of DNA derived from the geneticlocus is represented within the sequencing reads. For example, thesurface approximation can be a two-dimensional second-degree polynomialsurface fit of the measure related to initial template DNA molecules(e.g., UMCs) by the quantitative measures of GC content and p. In somecases, the transformed quantitative measure related to sequencingcoverage is the value expected based on the transformation determinedabove calculated from (i) the quantitative measures related to GCcontent and (ii) the quantitative measures related to the probabilitythat a strand of DNA derived from the genetic locus is representedwithin the sequencing reads. In some cases, the transformed quantitativemeasure related to sequencing coverage is the residual of each geneticlocus (e.g., the difference or quotient of the expected quantitativemeasure related to sequencing read coverage of a locus based on thesurface approximation and the observed quantitative measure related tosequencing read coverage of the genetic locus in the sample).Optionally, after the transformed quantitative measure related tosequencing coverage is determined, high-variance genetic loci can againbe removed as described above based on the new transformed quantitativemeasures of sequencing read coverage.

“Probe Efficiency” Correction

Disclosed herein are methods to determine and remove biases for geneticloci using reference samples. In some cases, the reference samples aresequencing reads from cell-free DNA from subjects without cancer. Insome cases, the reference samples are sequencing reads from cell-freeDNA from subjects with cancer cells that substantially lack copy numbervariation in the genetic loci of interest. In some cases, the referencesamples are sequencing reads from cell-free DNA from subjects withcancer, where regions suspected of have undergone copy number variationare excluded from analysis. In some cases, the reference sample is aplasma sample from a subject without cancer. In some cases, thereference sample is a plasma sample from a subject with cancer.

Each of the genetic loci of the reference samples can be processed asdescribed above in “saturation equilibrium correction” to providetransformed quantitative measures of sequencing read coverage. In somecases, the transformed quantitative measure related to sequencingcoverage is the value expected based on the transformation determinedabove calculated from (i) the quantitative measures related to GCcontent and (ii) the quantitative measures related to the probabilitythat a strand of DNA derived from the genetic locus from the referencegenetic loci is represented within the sequencing reads. In some cases,the transformed quantitative measure related to sequencing coverage isthe residual of each reference genetic locus (e.g., the difference orquotient of the expected quantitative measure related to sequencing readcoverage of a locus based on the surface approximation and the observedquantitative measure related to sequencing read coverage of the geneticlocus in the reference sample). The transformed quantitative measurerelated to sequencing read coverage of the genetic locus in thereference sample can be thought of as the “efficiency” of the geneticlocus. For example, a genetic locus that is inefficiently amplified willhave a lower UMC than a genetic locus (present at the same copy numberin the sample) that is very efficiently amplified.

The transformed quantitative measure related to sequencing read coverageof the sample can be corrected based on the determined efficiency of thegenetic loci from the reference sample(s). This correction can reducevariance introduced into the sample by the process of producing thesequencing reads from the sample, which can be related to ligationefficiency, pulldown efficiency, PCR efficiency, flow cell clusteringloss, demultiplexing loss, collapsing loss, and alignment loss. In oneembodiment, correction comprises dividing or subtracting thepost-saturation transformed quantitative measures of sequencing coverageof the sample by the predicted post-saturation transformed quantitativemeasure related to sequencing coverage. In some instances, the predictedpost-saturation transformed quantitative measure related to sequencingcoverage of the genetic loci is determined by fitting a relationshipbetween the post-saturation transformed quantitative measure related tosequencing coverage of the genetic loci from the sample and thepost-saturation transformed quantitative measure related to sequencingread coverage of the references. In some cases, fitting comprisesperforming local regression (e.g., LOESS or LOWESS) or robust linearregression of the post-saturation transformed quantitative measurerelated to sequencing coverage of the genetic loci from the sample onthe post-saturation transformed quantitative measure related tosequencing read coverage of the references. In some cases, the fittingcan be linear regression, non-linear regression, or non-parametricregression.

Optionally, the transformed quantitative measure from the probeefficiency correction can be the input into the “saturation equilibriumcorrection” transformation to produce a third, further transformedquantitative measure related to sequencing read coverage with reducedvariance. In general, transformed quantitative measures of sequencingcoverage can be transformed using any of the methods disclosed hereinadditional times in order to further reduce the variance within thetransformed quantitative measures of sequencing read coverage.

Gene Level Summaries

Gene level summaries of inferred copy number can be determined based onthe transformed quantitative measures of sequencing read coveragedetermined as disclosed herein. Copy number can be inferred relative tothe baseline selected in the above operations by discarding highvariance genetic loci. For example, if the remaining genetic loci areinferred to be diploid in the sample, then genetic loci for which thetransformed quantitative measure related to sequencing coverage differfrom the baseline can be inferred to have undergone copy numberalteration in the tumor cells. In some instances, gene-level z-scoresare calculated using observed gene-level median of probe signal andestimated standard deviation calculated using observed probe-levelstandard deviation estimate in a gene and whole-genome normal diploidprobe signal standard deviation.

Minor-Allele Frequency Baseline Optimization

Provided herein are methods to detect errors and correct errors in genelevel summaries of copy number described herein using minor allelefrequencies of variants in the sequencing reads. Sequence variantspresent in between 10% and 90%, between 20% and 80%, between 30% and70%, between 40% and 60%, or approximately 50% of sequencing reads fromnucleic acids from a cell-free bodily fluid can be heterozygous variantspresent in the germline sequence of the subject. In some instances,genetic loci have been determined to have undergone amplification asdescribed above. The quantities of variants are compared to the inferredcopy number to determine if variant frequency is inconsistent with theinferred copy number. In one example, heterozygous genetic loci can beexamined in the genetic loci that were used to determine the baselinecopy number (e.g., the genetic loci remaining after exclusion of thehigh-variance genetic loci). In some cases, numerous genetic loci in thesample have been amplified, and this baseline can be misidentified. Insuch cases, heterozygosity may deviate from a 1:1 ratio, and theinaccurate baselining is detected and corrected. In a second example,example, a genetic locus can be inferred to be present at a triploidcopy number based on the transformed quantitative measure related tosequencing read coverage. If the germline genome of the subject has onechromosome with a first allele of the genetic locus and a secondchromosome had a second allele, then the first or second allele may haveduplicated in the cancer cells.

Langmuir-Like Saturation Model

Without being bound by theory, disclosed herein is a Langmuir-likesaturation model assumed to be the governing mechanism of bait-cfDNAinteractions based on exploration of historical clinical data as well astargeted experiments involving synthetic spike-in model systems. Hence,in the absence of interfering assay effects (e.g. ligation efficiencies,PCR amplification biases, sequencing artifacts, etc), bait pulldownprocess may be described as

${{Unique}\mspace{14mu} {molecule}\mspace{14mu} {count}} = {I_{sat}\frac{K \cdot {CopyNumber}}{1 + {K \cdot {CopyNumber}}}}$

K in this description is bait efficiency, which is dependent on baitsequence characteristics and its interactions with DNA fragments ingenomic vicinity of the targeted bait location. I_(sat) is a saturationparameter driven by the limited initial bait count in the pulldownreaction, which is a function of total bait pool concentration as wellas replication count. Replication count as used herein refers to therelative or absolute amount of sequence capture probe present. Forexample, sequence capture arrays can provide for different molarquantities of probes on an array to account for differing probeefficiencies. FIG. 8 illustrates the model relating true copy number andunique molecule count based on bait efficiency, K, and saturationparameter I_(sat).

Bait efficiency K is largely driven by GC content, while I_(sat) isdriven by more complex bait exhaustion mechanisms and RNA secondarystructure interactions that can be crudely examined by studying uniquemolecule count vs. total read count interactions. Aside from non-linearpulldown reactions, probe signal can be further modeled by amultiplicative model involving the following assumption: under a naivemodel cfDNA fragments are uniformly distributed by genomic position withstochastic sampling process being the dominating factor contributing tocoverage variation. Then, copy number signal (e.g., UMCs) can be modeledby relating the observed UMC to the true molecular count in the sample,taking into account the effects the underlying positional cfDNA profile,ligation efficiency, pulldown efficiency, PCR efficiency, flowcellclustering loss, demultiplexing loss, collapsing loss, and alignmentloss.

Aside from non-linear pulldown reaction, probe signal can be furthermodeled by simple multiplicative model, involving the followingassumptions. Under a naïve model cfDNA fragments are uniformlydistributed by genomic position with stochastic sampling process beingthe dominating factor contributing to coverage variation. Then, copynumber signal, i.e. read count associated with a given probe can bemodeled as:

Observed UMC=True UMC×Underlying positional cfDNAprofile(bait,cfDNAfragment)×Ligationefficiency(position,size,cfDNAfragment)×pulldownefficiency(probe,cfDNAfragment)×PCR efficiency(DNA fragment)×flow cellclustering loss×demultiplexing loss and collapsing loss×alignmentloss(cfDNAfragment sequence).

This model assumes a multiplicative nature of the above model. Theunderlying bait-specific copy number signal can be inferred fromobserved UMC (e.g., UMC of a given sequence capture probe) in relationto an established baseline by a series of steps, such as the baselinedetermination methods disclosed herein.

Methods disclosed herein provide approaches for estimating probeefficiency and bait saturation from the sample and training sets.Alternately, such parameters may be inferred by performing a set of baittitration experiments, where the effect of varying target sequenceconcentration on UMCs is observed for each probe. If K, I_(sat), and UMCare known, it is then possible to determine a UMC value or rangecorresponding to tumor cells that have not undergone copy numbervariation. For example, under the assumption that most of the geneticloci have not undergone copy number alteration, the observed UMCs willlargely be derived from diploid samples. Samples that have undergonecopy number variation will be those genetic loci for which the UMCs falloutside the expected range for probes with their corresponding values ofK and I_(sat). In some cases, for example, the UMC value or range willbe a function depending on K and I_(sat) for each probe. For example,the UMC corresponding to a diploid copy number can be different betweentwo probes.

Computer Control Systems

The present disclosure provides computer control systems that areprogrammed to implement methods of the disclosure. FIG. 12 shows acomputer system 1201 that is programmed or otherwise configured toimplement methods of the present disclosure. The computer system 1201includes a central processing unit (CPU, also “processor” and “computerprocessor” herein) 1205, which can be a single core or multi coreprocessor, or a plurality of processors for parallel processing. Thecomputer system 1201 also includes memory or memory location 1210 (e.g.,random-access memory, read-only memory, flash memory), electronicstorage unit 1215 (e.g., hard disk), communication interface 1220 (e.g.,network adapter) for communicating with one or more other systems, andperipheral devices 1225, such as cache, other memory, data storageand/or electronic display adapters. The memory 1210, storage unit 1215,interface 1220 and peripheral devices 1225 are in communication with theCPU 1205 through a communication bus (solid lines), such as amotherboard. The storage unit 1215 can be a data storage unit (or datarepository) for storing data. The computer system 1201 can beoperatively coupled to a computer network (“network”) 1230 with the aidof the communication interface 1220. The network 1230 can be theInternet, an internet and/or extranet, or an intranet and/or extranetthat is in communication with the Internet. The network 1230 in somecases is a telecommunication and/or data network. The network 1230 caninclude a local area network. The network 1230 can include one or morecomputer servers, which can enable distributed computing, such as cloudcomputing. The network 1230, in some cases with the aid of the computersystem 1201, can implement a peer-to-peer network, which may enabledevices coupled to the computer system 1201 to behave as a client or aserver.

The CPU 1205 can execute a sequence of machine-readable instructions,which can be embodied in a program or software. The instructions may bestored in a memory location, such as the memory 1210. The instructionscan be directed to the CPU 1205, which can subsequently program orotherwise configure the CPU 1205 to implement methods of the presentdisclosure. Examples of operations performed by the CPU 1205 can includefetch, decode, execute, and writeback.

The CPU 1205 can be part of a circuit, such as an integrated circuit.One or more other components of the system 1201 can be included in thecircuit. In some cases, the circuit is an application specificintegrated circuit (ASIC).

The storage unit 1215 can store files, such as drivers, libraries andsaved programs. The storage unit 1215 can store user data, e.g., userpreferences and user programs. The computer system 1201 in some casescan include one or more additional data storage units that are externalto the computer system 1201, such as located on a remote server that isin communication with the computer system 1201 through an intranet orthe Internet.

The computer system 1201 can communicate with one or more remotecomputer systems through the network 1230. For instance, the computersystem 1201 can communicate with a remote computer system of a user.Examples of remote computer systems include personal computers (e.g.,portable PC), slate or tablet PC's (e.g., Apple® iPad, Samsung® GalaxyTab), telephones, Smart phones (e.g., Apple® iPhone, Android-enableddevice, Blackberry®), or personal digital assistants. The user canaccess the computer system 1201 via the network 1230.

Methods as described herein can be implemented by way of machine (e.g.,computer processor) executable code stored on an electronic storagelocation of the computer system 1201, such as, for example, on thememory 1210 or electronic storage unit 1215. The machine executable ormachine readable code can be provided in the form of software. Duringuse, the code can be executed by the processor 1205. In some cases, thecode can be retrieved from the storage unit 1215 and stored on thememory 1210 for ready access by the processor 1205. In some situations,the electronic storage unit 1215 can be precluded, andmachine-executable instructions are stored on memory 1210.

The code can be pre-compiled and configured for use with a machinehaving a processor adapted to execute the code, or can be compiledduring runtime. The code can be supplied in a programming language thatcan be selected to enable the code to execute in a pre-compiled oras-compiled fashion.

Aspects of the systems and methods provided herein, such as the computersystem 1201, can be embodied in programming. Various aspects of thetechnology may be thought of as “products” or “articles of manufacture”typically in the form of machine (or processor) executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Machine-executable code can be stored on an electronicstorage unit, such as memory (e.g., read-only memory, random-accessmemory, flash memory) or a hard disk. “Storage” type media can includeany or all of the tangible memory of the computers, processors or thelike, or associated modules thereof, such as various semiconductormemories, tape drives, disk drives and the like, which may providenon-transitory storage at any time for the software programming. All orportions of the software may at times be communicated through theInternet or various other telecommunication networks. Suchcommunications, for example, may enable loading of the software from onecomputer or processor into another, for example, from a managementserver or host computer into the computer platform of an applicationserver. Thus, another type of media that may bear the software elementsincludes optical, electrical and electromagnetic waves, such as usedacross physical interfaces between local devices, through wired andoptical landline networks and over various air-links. The physicalelements that carry such waves, such as wired or wireless links, opticallinks or the like, also may be considered as media bearing the software.As used herein, unless restricted to non-transitory, tangible “storage”media, terms such as computer or machine “readable medium” refer to anymedium that participates in providing instructions to a processor forexecution.

Hence, a machine readable medium, such as computer-executable code, maytake many forms, including but not limited to, a tangible storagemedium, a carrier wave medium or physical transmission medium.Non-volatile storage media include, for example, optical or magneticdisks, such as any of the storage devices in any computer(s) or thelike, such as may be used to implement the databases, etc. shown in thedrawings. Volatile storage media include dynamic memory, such as mainmemory of such a computer platform. Tangible transmission media includecoaxial cables; copper wire and fiber optics, including the wires thatcomprise a bus within a computer system. Carrier-wave transmission mediamay take the form of electric or electromagnetic signals, or acoustic orlight waves such as those generated during radio frequency (RF) andinfrared (IR) data communications. Common forms of computer-readablemedia therefore include for example: a floppy disk, a flexible disk,hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD orDVD-ROM, any other optical medium, punch cards paper tape, any otherphysical storage medium with patterns of holes, a RAM, a ROM, a PROM andEPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wavetransporting data or instructions, cables or links transporting such acarrier wave, or any other medium from which a computer may readprogramming code and/or data. Many of these forms of computer readablemedia may be involved in carrying one or more sequences of one or moreinstructions to a processor for execution.

The computer system 1201 can include or be in communication with anelectronic display 1235 that comprises a user interface (UI) 1240 forproviding, for example, a report. Examples of UI's include, withoutlimitation, a graphical user interface (GUI) and web-based userinterface.

Methods and systems of the present disclosure can be implemented by wayof one or more algorithms. An algorithm can be implemented by way ofsoftware upon execution by the central processing unit 1205.

EXAMPLES Example 1

Examination of previously-generated copy number variation spike-in datarevealed significant probe-to-probe signal variation, both in raw readcounts and UMCs, as well as the probe/gene-level copy number signalresponse to underlying copy number changes. See FIG. 2. FIG. 3illustrates the inferred versus theoretical copy number of three genes(CCND1, CCND2, and ERBB2), demonstrating the non-linear response ofnormalized coverage to the amount of bait in the sample. These resultssuggest bait depletion during pulldown, which was confirmed by followingbait titration effect in neighboring probes within the same gene withsizable differences in unique molecule count (thereby observing fasterunique molecule count saturation for probes with high initial UMC).

FIG. 4A illustrates that the UMCs associated with each probe has anon-linear response with respect to probe p. FIG. 4B illustrates thatUMCs associated with each probe have a non-linear response with respectto probe GC content.

FIG. 5 illustrates UMCs of probes without performing saturation orprobe-efficiency correction. FIG. 6 shows the same sample aftersaturation correction. FIG. 7 shows the same sample after probeefficiency correction. The variance within genomic positions is reducedat each stage, leading to a clearer picture of the underlying copynumber variation of the tumor cells emerging. Genes in FIG. 7 for whichthe median probe post-probe efficiency correction signal is above 1.2are called as having undergone copy number variation. Differing levelsof post-probe efficiency correction signal are likely due to tumorheterogeneity or secondary tumors.

FIG. 9 shows the typical progression of baselining genetic loci probesignal noise-reduction after saturation correction and probe efficiencycorrection.

FIG. 10A illustrates a plot of probe efficiency of the referencesample(s) on the x-axis and the sample's post-saturation correctedsignal from a subject without copy-number variation in tumor cells. Therelationship is approximately linear. FIG. 10B illustrates a similarplot from a subject with copy number variation in tumor cells. Theresponse is not as linear as FIG. 10A. Correcting by the predictedefficiency inferred by determining the relationship between the probeefficiency from the reference sample(s) and the post-saturationcorrected UMCs of the baselining genetic loci (indicated in black) willreduce variation due to differing probe efficiencies in the genetic locithat have putatively undergone copy number amplification in tumor cells(dots in grey). FIG. 11 illustrates an exemplary report of copy numbervariation from a patient sample based on post-saturation andprobe-efficiency corrected UMCs and MAF-optimized baselining. Starsindicate points that are indicated to belong to genetic loci that haveundergone copy number variation in the tumor cells of the subject.

Example 2

Cell-free DNA is obtained from a subject with cancer, a barcodedsequencing library is prepared, a panel of oncogenes is enriched bysequence capture with a probe set, and the barcoded sequencing libraryis sequenced. The sequencing reads are mapped to a reference genome andcollapsed into families based on their barcode sequences and mappingposition. For each genomic coordinate corresponding to a midpoint of aprobe from the probe set, the number of read families spanning thatmidpoint is counted to obtain a per-probe UMC. A median per-probe UMC isdetermined for each gene. To perform “saturation equilibriumcorrection,” the genes are grouped by their median per-probe GC content.Genes for which the median per-probe UMCs differs significantly fromthose genes with similar median per-probe GC content are removed.

For each probe, p and GC content are determined as described herein. Theremaining genes from the previous step are used to perform atwo-dimension second-degree polynomial surface fit of the mediangene-level UMC to probe p and GC content. The function relating p and GCcontent to an expected UMC is used to determine expected per-probe UMCs.Residuals are determined for the data set by dividing the observedper-probe UMCs by the expected per-probe UMCs. The residual UMCs of eachprobe are the transformed quantitative measures of sequencing coverage.

Genes are again grouped by their median per-probe GC content, and geneswhose median per-probe residual UMCs are significantly different fromgenes with similar median per-probe GC content are removed. “Probeefficiency” correction is then performed by obtaining residual UMCs ofreference sample(s) as described in the preceding paragraphs. Theresidual UMCs of each probe from the sample are then divided by theresidual UMCs of each corresponding probe from the reference(s) toobtain post-probe efficiency corrected UMCs.

Similar to saturation equilibrium correction above, the remaining genesare used to perform a two-dimension second-degree polynomial surface fitof the post-probe efficiency corrected UMC to probe p and GC content.The function relating p and GC content to an expected post-probeefficiency corrected UMC is used to determine an expected per-probepost-probe efficiency corrected UMC. Residuals are determined for thedata set by dividing the observed per-probe post-probe efficiencycorrected UMC by the expected per-probe post-probe efficiency correctedUMC. The residual post-probe efficiency corrected UMC of each probe arethe post-probe GC-corrected signal.

The remaining genes are grouped by their median per-probe GC content,and genes whose median post-probe GC-corrected signal differssignificantly from those genes with similar median per-probe GC contentare removed.

The process of the example is repeated, with the post-probe GC-correctedsignal as the starting input instead of the initial UMC.

For each gene, the median of the post-probe GC corrected signal is usedto summarize each gene. Genes whose median post-probe GC-correctedsignal is significantly different than the other genes are considered ascandidates for having undergone gene amplification or deletion in thetumor cells.

For each gene, germline heterozygous alleles are determined and therelative frequency of each allele is quantified. Genetic loci used forbaselining are found to have an approximately 1:1 ratio of alleles,validating the selection of baselining genetic loci.

A Z-score is determined for each gene based on the gene-level medianpost-probe GC-corrected signals and estimated standard deviations fromwhole-genome normal diploid probe signals. Genes with Z-scores higherthan a cut-off are reported as having undergone gene amplification intumor cells.

Example 3

The methods described herein were validated by measuring ERBB2 copynumber in a method of the present disclosure against a control method.The method of the present disclosure produced a linear response ofobserved copy number (CN) vs. theoretical copy number, with no observedfalse positive CNV results in a normal (healthy) cohort. See FIG. 13,which shows the inferred gene copy number vs. the theoretical copynumber estimate, with solid dots representing an observed copy number of˜2 (a diploid sample), open dots representing detected amplificationevents and the thick horizontal dashed line marking the mean gene CNcutoff. See also FIG. 14, which depicts the data of FIG. 13, with thecontrol data represented by squares. All CNVs followed the expectedtitration trend down to 2.15 copies. Moreover, the method of the presentdisclosure decreased observed “noise” in the data due to a reduction invariance, allowing a CNV to be easily distinguished as compared to thecontrol method. See the far right side of FIG. 15; triangles representthe method of the disclosure, while X's represent the control method.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

What is claimed is:
 1. A method, comprising: (a) obtaining sequencingreads of deoxyribonucleic acid (DNA) molecules of a cell-free bodilyfluid sample of a subject; (b) generating from the sequencing reads afirst data set comprising, for each genetic locus in a plurality ofgenetic loci, a quantitative measure related to sequencing read coverage(“read coverage”); (c) correcting the first data set by performingsaturation equilibrium correction and probe efficiency correction; (d)determining a baseline read coverage for the first data set, wherein thebaseline read coverage relates to saturation equilibrium and probeefficiency; and (e) determining a copy number state for each geneticlocus in the plurality of genetic loci relative to the baseline readcoverage.
 2. The method of claim 1, wherein the first data setcomprises, for each genetic locus in a plurality of genetic loci, aquantitative measure related to guanine-cytosine content of the geneticlocus (“GC content”).
 3. The method of claim 2, comprising prior to (c)removing from the first data set genetic loci that are high-variancegenetic loci, wherein removing comprises: (i) fitting a model relatingthe quantitative measures related to guanine-cytosine content and thequantitative measures of sequencing read coverage of the genetic loci;and (ii) removing from the first data set at least 10% of the pluralityof genetic loci, wherein removing the genetic loci comprises removingthe at least 10% of the plurality of genetic loci that most differ fromthe model, thereby providing the first data set of baselining geneticloci.
 4. (canceled)
 5. The method of claim 3, wherein performingsaturation equilibrium correction comprises transforming the first setdata set of baselining genetic loci into a saturation corrected data setby: (i) determining for each genetic locus from the first data set ofbaselining genetic loci a quantitative measure related to a probabilitythat a strand of DNA molecule from the sample derived from the geneticlocus is represented within the sequencing reads; (ii) determining afirst transformation for the read coverage by relating the read coveragein the first data set of baselining genetic loci to both the GC contentof the first data set of baselining genetic loci and the quantitativemeasure related to the probability that a strand of DNA derived fromeach genetic locus in the first data set of baselining genetic loci isrepresented within the sequencing reads; and (iii) applying the firsttransformation to the read coverage of each genetic locus from the firstdata set of baselining genetic loci to provide the saturation correcteddata set, wherein the saturation corrected data set comprises a firstset of transformed read coverages of the first data set of baselininggenetic loci;
 6. The method of claim 5, wherein determining the firsttransformation comprises (i) determining a measure related to centraltendency of the read coverage of the first data set of baselininggenetic loci; (ii) determining a function that fits the measure relatedto central tendency of the read coverage of the first data set ofbaselining genetic loci based on the GC content of the genetic locus andthe quantitative measure related to the probability that a strand of DNAderived from the genetic locus is represented within the sequencingreads; and (iii) for each genetic locus of the first data set ofbaselining genetic loci, determining a difference between the readcoverage predicted by the function and the read coverage, wherein thedifference is the transformed read coverage.
 7. (canceled)
 8. (canceled)9. The method of claim 5, wherein performing probe efficiency correctioncomprises transforming the saturation corrected data set into a probeefficiency corrected data set by: (i) removing from the saturationcorrected data set genetic loci that are high-variance genetic loci withrespect to the first set of transformed read coverages, therebyproviding a second data set of baselining genetic loci; (ii) determininga second transformation for the first set of transformed read coveragesrelated to the probe efficiency of the second data set of baselininggenetic loci; and (iii) transforming the first set of transformed readcoverages of the second data set of baselining genetic loci with thesecond transformation, thereby providing the probe efficiency correcteddata set, wherein the probe efficiency corrected data set comprises asecond set of transformed read coverages of the second data set ofbaselining genetic loci.
 10. The method of claim 9, wherein removingfrom the first data set genetic loci that are high-variance genetic locicomprises: (i) fitting a model relating the GC content and the first setof transformed read coverages of the saturation corrected data set; and(ii) removing from saturation corrected data set at least 10% of thegenetic loci, wherein the removing the genetic loci comprises removinggenetic loci that most differ from the model, thereby providing thesecond data set of baselining genetic loci.
 11. (canceled) 12.(canceled)
 13. (canceled)
 14. (canceled)
 15. (canceled)
 16. The methodof claim 5, further comprising: (g) determining a third transformationfor the second set of transformed read coverages by relating thetransformed read coverages of the second data set of baselining geneticloci to both the GC content of the second data set of baselining geneticloci and the quantitative measure related to the probability that astrand of DNA derived from the each locus in the second data set ofbaselining genetic loci is represented within the sequencing reads; (h)applying the third transformation to the second set of transformed readcoverages to provide a fourth data set, wherein the fourth data setcomprises a third set of transformed quantitative read coverages. 17.The method of claim 1, wherein the DNA molecules of the cell-free bodilyfluid sample are enriched for the plurality of genetic loci using one ormore oligonucleotide probes that are complementary to at least a portionof the genetic loci from the plurality of genetic loci.
 18. The methodof claim 17, wherein the GC content of each genetic locus from theplurality of genetic loci is a measure related to central tendency ofguanine-cytosine content of the one or more oligonucleotide probes thatare complementary to at least a portion of the genetic loci from theplurality of genetic loci.
 19. The method of claim 17, wherein the readcoverage of the genetic locus is a measure related to central tendencyof the read coverage of regions of the genetic locus corresponding tothe one or more oligonucleotide probes.
 20. The method of claim 17,wherein the performing saturation equilibrium correction and theperforming probe efficiency correction comprise fitting a Langmuirmodel, wherein the Langmuir model comprises probe efficiency (K) andsaturation equilibrium constant (I_(sat)).
 21. (canceled)
 22. (canceled)23. (canceled)
 24. (canceled)
 25. (canceled)
 26. (canceled) 27.(canceled)
 28. The method of claim 1, wherein obtaining the sequencingreads comprises ligating adaptors to the DNA molecules from thecell-free bodily fluid sample from the subject.
 29. The method of claim28, wherein the DNA molecules comprise duplex DNA molecules and theadaptors are ligated to the duplex DNA molecules such that each adaptordifferently tags complementary strands of the DNA molecule to providetagged strands.
 30. The method of claim 29, wherein determining thequantitative measure related to the probability that a strand of DNAderived from the genetic locus is represented within the sequencingreads comprises sorting sequencing reads into paired reads and unpairedreads, wherein (i) each paired read corresponds to sequence readsgenerated from a first tagged strand and a second differently taggedcomplementary strand derived from a double-stranded polynucleotidemolecule in said set, and (ii) each unpaired read represents a firsttagged strand having no second differently tagged complementary strandderived from a double-stranded polynucleotide molecule represented amongsaid sequence reads in said set of sequence reads.
 31. The method ofclaim 30, further comprising determining quantitative measures of (i)said paired reads and (ii) said unpaired reads that map to each of oneor more genetic loci to determine a quantitative measure related tototal double-stranded DNA molecules in said sample that map to each ofsaid one or more genetic loci based on said quantitative measure relatedto paired reads and unpaired reads mapping to each locus.
 32. The methodof claim 28, wherein the adaptors comprise barcode sequences.
 33. Themethod of claim 32, wherein determining the read coverage comprisescollapsing the sequencing reads based on position of the mapping of thesequencing reads to the reference genome and the barcode sequences. 34.(canceled)
 35. The method of claim 1, further comprising determiningthat at least a subset of the baselining genetic loci have undergonecopy number alteration in the tumor cells of the subject by determiningrelative quantities of variants within the baselining genetic loci forwhich the germline genome of the subject is heterozygous.
 36. The methodof claim 35, wherein the relative quantities of the variants are notapproximately equal, and wherein the baselining genetic loci for whichthe relative quantities of the variants are not approximately equal areremoved from the baselining genetic loci, thereby providingallelic-frequency corrected baselining genetic loci.
 37. (canceled) 38.(canceled)
 39. (canceled)
 40. (canceled)
 41. (canceled)